Welcome back. As a reminder, we've got a dataset with my cycling data from last year merged and stored in an HDF5 store. Today we'll use pandas, seaborn, and matplotlib to do some exploratory data analysis.
In [2]:
%matplotlib inline
import os
import datetime
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [3]:
df = pd.read_hdf(os.path.join('data', 'cycle_store.h5'), key='merged')
df.head()
Out[3]:
The columns should be pretty obvious. The pair (ride_id
, Time
) form a unique index for the DataFrame.
We may set them to be the MulitIndex later. Pace is just $\frac{time}{distance}$.
Let's drop some of the redundent columns and do a bit of renaming.
In [4]:
df = df.drop(['Ride Time', 'Stopped Time', 'Pace', 'Average Pace'], axis=1)
def renamer(name):
for char in ['(', ')']:
name = name.replace(char, '')
name = name.replace(' ', '_')
name = name.lower()
return name
df = df.rename(columns=renamer)
Using a snazzy new feature coming out in pandas 0.15, we
can easily access the time
attributes of the the datetime
columns under the .dt
namespace.
In [5]:
sub = df[['time', 'distance_miles', 'ride_id']].copy()
sub['time'] = sub['time'].dt.time
sub = sub.sort(columns='time')
Let's break them into two parts, morning and afternoon.
In [6]:
morning = sub[sub.time < datetime.time(12)]
evening = sub[sub.time >= datetime.time(12)]
In [7]:
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(morning.time.values, morning.distance_miles, marker='.',
color='k', linewidths=0.01, alpha=.5)
ax.set_ylim(0, 8)
Out[7]:
In [8]:
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(evening.time.values, evening.distance_miles, marker='.',
color='k', linewidths=0.01, alpha=.5)
ax.set_ylim(0, 8)
Out[8]:
Fun. The horizontal distance is the length of time it took me to make the ride. I like this chart because it also conveys the start time of each ride. The plot shows that the morning ride typically took longer, but we can verify that.
In [9]:
is_morning = df.time.dt.time < datetime.time(12, 0, 0)
ride_time = df.groupby(['ride_id', is_morning])['ride_time_secs'].agg('max')
mean_time = ride_time.groupby(level=1).mean().rename(
index={True: 'morning', False: 'evening'})
mean_time / 60
Out[9]:
So the morning ride is typically shorter! But I think I know what's going on. We were misleading with our plot earlier since the range of the horizontal axis weren't identical. Always check the axis!
In [10]:
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(morning.time.values, morning.distance_miles, marker='.',
color='k', linewidths=0.01, alpha=.5)
ax.set_xlim(datetime.time(6, 40), datetime.time(10, 40))
ax.set_ylim(0, 8)
Out[10]:
In [11]:
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(evening.time.values, evening.distance_miles, marker='.',
color='k', linewidths=0.01, alpha=.5)
ax.set_xlim(datetime.time(15, 30), datetime.time(19, 30))
ax.set_ylim(0, 8)
Out[11]:
In [12]:
axes = df.groupby('ride_id').plot(x='distance_miles', y='elevation_feet', color='k', alpha=.5)
In [13]:
df['is_morning'] = df.time.dt.time < datetime.time(12)
In [14]:
sns.tsplot(df, time="distance_miles", unit="ride_id", value="elevation_feet")
In [ ]:
df.groupby('ride_id').plot(x='distance_miles', y='elevation_feet')
In [ ]:
sns.tsplot(df, time="ride_time_secs", value="distance_miles", condition="is_morning", unit="ride_id")
In [15]:
Out[15]:
In [ ]:
In [ ]:
In [ ]: